Skip to content

feat: add --dry-run estimation mode#1592

Open
mvanhorn wants to merge 3 commits intointel:mainfrom
mvanhorn:osc/feat-dry-run
Open

feat: add --dry-run estimation mode#1592
mvanhorn wants to merge 3 commits intointel:mainfrom
mvanhorn:osc/feat-dry-run

Conversation

@mvanhorn
Copy link

Summary

Adds a --dry-run flag to the CLI that estimates VRAM usage, output file size, and approximate quantization time without running the full quantization process.

  • Loads only the model config via AutoConfig.from_pretrained() (no weights downloaded)
  • Estimates peak VRAM from parameter count, dtype, batch size, and sequence length
  • Estimates output file size from target bit width, parameter count, and group size overhead
  • Estimates time from layer count, iterations, and calibration batch count
  • Prints a formatted summary table and exits

Motivation

Users quantizing large models (70B+) need to know resource requirements before committing GPU hours. This is relevant to #1551 (reduce quant cost) and #1584 (peak VRAM tracking).

Example output

============================================================
  AutoRound Dry-Run Estimation
============================================================
  Model:              meta-llama/Llama-2-7b-hf
  Parameters:         6.61B
  Layers:             32
  Target bits:        4
  Group size:         128
  Model dtype:        float16
============================================================
  Estimated peak VRAM:    17.80 GB
  Estimated output size:  3.64 GB
  Estimated time:         3.4 hours
    (batch_size=8, seqlen=2048, nsamples=128, iters=200)
============================================================
  NOTE: These are rough estimates. Actual values depend on
  hardware, model architecture, and runtime conditions.
============================================================

Changes

  • New: auto_round/estimation.py - VRAM, disk, and time estimation functions
  • Modified: auto_round/__main__.py - --dry_run / --dry-run CLI flag, short-circuits before model loading
  • New: test/test_cpu/core/test_estimation.py - unit tests for all estimation functions

Testing

All estimation unit tests pass (parameter counting, VRAM estimation, output size calculation, time estimation, format helpers). Tests use stub configs to avoid model downloads.

Fixes #1591

This contribution was developed with AI assistance (Claude Code).

mvanhorn and others added 2 commits March 21, 2026 23:05
Add a --dry-run flag to the CLI that estimates VRAM usage, output file
size, and approximate quantization time without running the full
quantization process. Uses AutoConfig to load model architecture
metadata without downloading weights.

New module: auto_round/estimation.py with estimation functions for
parameter count, peak VRAM, output size, and time.

Relates to intel#1551 and intel#1584
Fixes intel#1591

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Refactor _count_parameters into smaller helpers to reduce local
variable count. Convert dry_run_estimate to use **kwargs and
extract helpers for config loading and result building.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
@wenhuach21 wenhuach21 requested review from n1ck-guo and xin3he March 22, 2026 07:37
hidden_size^2 * num_layers heuristic when fields are missing.
"""
hidden = getattr(config, "hidden_size", None)
num_layers = getattr(config, "num_hidden_layers", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically perform block-wise tuning. By a “block,” we mean a decoder layer, which typically contains 6–7 linear layers for non-moe models.

- CUDA overhead and fragmentation (~20% buffer)
"""
# Model weights
model_bytes = param_count * model_dtype_bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to cache some input data for the block when "low_gpu_mem_usage" is not enabled


# Rough seconds per layer per iteration, measured on A100 for a 7B-class model.
# Actual speed varies widely by hardware and model architecture.
_SECS_PER_LAYER_PER_ITER = 0.12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a dummy block to get the real performance of current machine?


# Optimizer state: roughly 2x one block's parameters (momentum + variance for Adam)
# Approximate one block as total_params / num_layers
block_overhead = model_bytes * 0.05 # ~5% of model for one block's optimizer state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

card_0_used_memory = block_input_output_memory + layer_activation_memory + additional_memory

card_0_used_memory = block_input_output_memory + layer_activation_memory + additional_memory
I have summarized the key points regarding block_overhead here, and I hope this proves insightful for you.

hidden_size^2 * num_layers heuristic when fields are missing.
"""
hidden = getattr(config, "hidden_size", None)
num_layers = getattr(config, "num_hidden_layers", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The num_hidden_layers may not cover most model cases. Claude could help refine it.
By the way, we may need to apply special handling to the MOE model.

@mvanhorn
Copy link
Author

Thanks for the detailed feedback on the estimation approach.

@wenhuach21 Good point on block-wise tuning and the input caching overhead. I'll update the estimation to account for per-block input/output caching when low_gpu_mem_usage is disabled.

@xin3he The dummy block idea for real machine benchmarking is interesting - that would give more accurate estimates than extrapolation. I'll look into the block_overhead breakdown you linked and refine the estimation to handle MOE models separately. The num_hidden_layers limitation is a fair point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: add --dry-run estimation mode

3 participants